1 About the Project

The phenomenon of crowdfunding, an alternative financing approach, involves raising funds for a new business ventures via small amounts of capital from a large number of individuals. Crowdfunding is a relatively new phenomenon enabled by wide access to social media and internet-based financial technology services (Fintech)., It makes obtaining funding more accessible for entrepreneurs and small businesses, as compared to traditional banking and lending services.

Little academic research has been conducted on crowdfunding, and there are many interesting areas for investigation. From a financial perspective, it is disrupting the small- and medium- enterprise (SME) lending market. Economically, it may be changing the prevalence and makeup of SMEs. In terms of marketing, it gives consumers a greater say in the products they would like to see available, but also exposes them to increased risk. Regarding information and technology, it is enabling innovations on a public platform.

Our project would entail exploration of datasets regarding Indiegogo and Kickstarter projects., The primary project goal would be to construct a model that predicts crowdfunding success. In order to accomplish this, additional data sources may be required regarding consumer demand, small businesses, etc. From that predictive model, we seek to make recommendations (1) to entrepreneurs, regarding when and how to employ crowdfunding for project financing, and (2) to the lending services and venture capital industries, regarding how their business models should react.

From this independent study course, I expect to: * Use multiple publicly-available crowdfunding datasets and R programming, * Clean data and conduct primary research to obtain additional variables (as needed), * Apply statistical analysis techniques to describe trends, * Construct a model to predict crowdfunding success, and * Prescribe best practices for entrepreneurs to leverage crowdfunding.

2 Key Findings

Kickstarter is AWESOME!

I can use Github!

3 Exploratory Analysis

3.1 $backers_count

Backers count is a very strong predictor of success. We can see from the the logistic model that after ~75 backers succcess is almost guranteed. However, we are focusing on variables at the begining of a project.

df_backers = data.frame(funded = df_engr$funded, backers_count = df_engr$backers_count)
df_backers$df_bc_20 = cut(df_backers$backers_count, breaks = unique(quantile(df_backers$backers_count, seq(0, 1, by = .05))), include.lowest = TRUE)

df_backers %>% 
  ggplot(aes(x = df_bc_20, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
    labs(
    title = "Funding By Backers",
    x="Number of Backers", 
    y="Chance of Funding") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme_update(plot.title = element_text(color="#666666", face="bold", size=22, hjust=0)) +
  theme_update(axis.title = element_text(color="#666666", face="bold", size=18)) +
  theme(plot.title = element_text(hjust = 0.5))

#Clean Up
rm(df_backers)

3.2 Category variables

$category_parent_id, $category_slug - Some categories never fail in this dataset (only considered if n > 50): - design/product design (1098 projects) - film & video/documentary (2202 projects) - film & video/shorts (3513 projects) - games/tabletop games (1064 projects) - Most successful parent categories (only considered if n > 100): - 7 1,725 projects 81.2% successful - 11 12,087 projects 65.2% successful - 14 14,635 projects 59.2% successful - Least successful parent categories (only considered if n > 100): - 18 8,725 projects 39.7% successful - 16 1,638 projects 46.2% successful - 12 4,125 projects 46.9% successful

# By main category
# 15 unique categories

df_engr %>%
  group_by(category) %>%
  summarise(count = n(), 
            success_rate = round(mean(funded)*100, 2), 
            mean(goal, na.rm = TRUE), 
            min(percent_funded, na.rm = TRUE), 
            median(percent_funded, na.rm = TRUE), 
            max(percent_funded, na.rm = TRUE),
            mean(percent_funded, na.rm = TRUE), 
            mean(backers_count), 
            min(backers_count), 
            median(backers_count), 
            max(backers_count), 
            mean(avg_contribution, na.rm = TRUE), 
            min(avg_contribution, na.rm = TRUE), 
            median(avg_contribution, na.rm = TRUE), 
            max(avg_contribution, na.rm = TRUE)) %>%
  arrange(desc(count)) %>%
  ungroup(category)
## # A tibble: 14 x 16
##    category  count success_rate `mean(goal, na.rm =… `min(percent_funded,…
##    <fct>     <int>        <dbl>                <dbl>                 <dbl>
##  1 music     14635         59.2                7589.                    0 
##  2 film&vid… 12087         65.2               22409.                    0 
##  3 publishi…  8725         39.7                7951.                    0 
##  4 art        6122         51.2                9139.                    0 
##  5 games      4125         46.9               38009.                    0 
##  6 design     1725         81.2               12981.                    0 
##  7 technolo…  1638         46.2               68119.                    0 
##  8 comics       72        100                 14186.                  100 
##  9 crafts       61         96.7                3235.                    0 
## 10 theater      57        100                  5994.                  100 
## 11 food         45        100                 19737.                  100.
## 12 fashion      36        100                 18178.                  101.
## 13 dance        11        100                  3165.                  102.
## 14 photogra…    11        100                  9061.                  101.
## # ... with 11 more variables: `median(percent_funded, na.rm =
## #   TRUE)` <dbl>, `max(percent_funded, na.rm = TRUE)` <dbl>,
## #   `mean(percent_funded, na.rm = TRUE)` <dbl>,
## #   `mean(backers_count)` <dbl>, `min(backers_count)` <dbl>,
## #   `median(backers_count)` <dbl>, `max(backers_count)` <dbl>,
## #   `mean(avg_contribution, na.rm = TRUE)` <dbl>, `min(avg_contribution,
## #   na.rm = TRUE)` <dbl>, `median(avg_contribution, na.rm = TRUE)` <dbl>,
## #   `max(avg_contribution, na.rm = TRUE)` <dbl>
#Logit
#Some categories significant
logit_category <- glm(funded ~ category, data = df_engr, family = binomial(link='logit'))
summary(logit_category)
## 
## Call:
## glm(formula = funded ~ category, family = binomial(link = "logit"), 
##     data = df_engr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6145  -1.1975   0.9257   1.0244   1.3586  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.04705    0.02557   1.840 0.065732 .  
## categorycomics       14.51902  104.03231   0.140 0.889006    
## categorycrafts        3.33734    0.71945   4.639 3.51e-06 ***
## categorydance        14.51902  266.15714   0.055 0.956497    
## categorydesign        1.41715    0.06674  21.235  < 2e-16 ***
## categoryfashion      14.51902  147.12390   0.099 0.921388    
## categoryfilm&video    0.57870    0.03191  18.136  < 2e-16 ***
## categoryfood         14.51902  131.59162   0.110 0.912145    
## categorygames        -0.17085    0.04034  -4.235 2.28e-05 ***
## categorymusic         0.32408    0.03060  10.590  < 2e-16 ***
## categoryphotography  14.51902  266.15714   0.055 0.956497    
## categorypublishing   -0.46351    0.03365 -13.774  < 2e-16 ***
## categorytechnology   -0.19875    0.05577  -3.564 0.000365 ***
## categorytheater      14.51902  116.92222   0.124 0.901175    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67757  on 49349  degrees of freedom
## Residual deviance: 65278  on 49336  degrees of freedom
## AIC: 65306
## 
## Number of Fisher Scoring iterations: 13
df_engr %>% 
  group_by(category) %>%
  summarise(count = n(), 
            funded_rate = round(mean(funded)*100, 2)) %>%
  arrange(desc(count)) %>%
  ggplot(aes(x = reorder(category, -count), y = count)) +
  geom_bar(stat="identity", fill = "#332288") +
  labs(
    title = "Frequency By Category",
    x="Project Category", 
    y="Number of Projects") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

df_engr %>% 
  group_by(category) %>%
  summarise(count = n(), 
            funded_rate = round(mean(funded)*100, 2)) %>%
  arrange(desc(count)) %>%
  ggplot(aes(x = reorder(category, -count), y = funded_rate)) +
  geom_bar(stat="identity", fill = "#332288") +
  labs(
    title = "Funding By Category",
    x="Project Category", 
    y="Chance of Funding") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

# By subcategory
# 140 unique category/subcategory combinations
#df_cats %>%
#  group_by(as.character(category_slug)) %>%
#  summarise(n(), 
#            success_rate = round(mean(funded)*100, 2), 
#            min(percent_funded, na.rm = TRUE), 
#            median(percent_funded, na.rm = TRUE), 
#            max(percent_funded, na.rm = TRUE),
#            mean(percent_funded, na.rm = TRUE), 
#            mean(backers_count), 
#            min(backers_count), 
#            median(backers_count), 
#            max(backers_count), 
#            mean(avg_contribution, na.rm = TRUE), 
#            min(avg_contribution, na.rm = TRUE), 
#            median(avg_contribution, na.rm = TRUE), 
#            max(avg_contribution, na.rm = TRUE)) %>%
#  ungroup(category_slug)



#Clean up
rm(logit_category)

3.3 Explore $comments_count

The vast majority of projects received fewer than 20 comments. Success rate seems to increases exponentially when more comments are received.

df_commments = data.frame(funded = df_engr$funded, 
                          comments_count = db_cleaned$comments_count)
df_commments$df_cc_20 = cut(df_commments$comments_count, 
                            breaks = unique(quantile(df_commments$comments_count, 
                                                     seq(0, 1, by = .05))), 
                            include.lowest = TRUE)

df_commments %>% 
  ggplot(aes(x = df_cc_20, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
    labs(
    title = "Funding By Comments",
    x="Number of Comments", 
    y="Chance of Funding") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

#Logit
#significant
logit_comments <- glm(funded ~ comments_count, data = df_commments, family = binomial(link='logit'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit_comments)
## 
## Call:
## glm(formula = funded ~ comments_count, family = binomial(link = "logit"), 
##     data = df_commments)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -1.1927   0.6619   1.1351   1.1622  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.035838   0.009950   3.602 0.000316 ***
## comments_count 0.032261   0.001054  30.612  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67757  on 49349  degrees of freedom
## Residual deviance: 64233  on 49348  degrees of freedom
## AIC: 64237
## 
## Number of Fisher Scoring iterations: 12
#Group into deciles and look as levels
#df_comments_count_filtered <- db_cleaned %>%
#  filter(comments_count <= 200) %>% 
#  mutate(comment_count_10 = cut(comments_count, 10))

#df_comments_count_filtered %>%
#  ggplot() + 
#  theme_minimal() + 
#  ggtitle("$comment_count in Deciles") + 
#  geom_histogram(aes(comment_count_10), stat = "count")

#df_comments_count_filtered %>%
#  group_by(comment_count_10) %>%
#  summarise(n(), 
#            mean(comments_count), 
#            success_rate = round(mean(funded)*100, 2), 
#            mean(backers_count), 
#            mean(avg_contribution, na.rm = TRUE)) %>%
#  ungroup(comment_count_10)

#Clean up
rm(logit_comments)
rm(df_commments)

3.4 $goal

Goal is likely a good predicter. The vast majority of goals are under $50,000. The likelihood of being funded decreases as goals get larger.

df_engr %>% 
  ggplot(aes(x = goal_20, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
    labs(
    title = "Funding By Goal",
    x="Goal Amount", 
    y="Chance of Funding") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

#Linear Models
summary(glm(funded ~ goal, data = df_engr))
## 
## Call:
## glm(formula = funded ~ goal, data = df_engr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5591  -0.5585   0.4410   0.4413   1.3494  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.591e-01  2.240e-03 249.579   <2e-16 ***
## goal        -8.887e-08  1.011e-08  -8.791   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2463031)
## 
##     Null deviance: 12174  on 49349  degrees of freedom
## Residual deviance: 12155  on 49348  degrees of freedom
## AIC: 70904
## 
## Number of Fisher Scoring iterations: 2
summary(glm(funded ~ goal_20, data = df_engr))
## 
## Call:
## glm(formula = funded ~ goal_20, data = df_engr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7263  -0.5223   0.3081   0.4096   0.7480  
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.726339   0.007955  91.309  < 2e-16 ***
## goal_20(500,750]         -0.034405   0.015033  -2.289   0.0221 *  
## goal_20(750,1e+03]       -0.081382   0.011433  -7.118 1.11e-12 ***
## goal_20(1e+03,1.5e+03]   -0.068111   0.015761  -4.321 1.55e-05 ***
## goal_20(1.5e+03,1.8e+03] -0.053758   0.012555  -4.282 1.86e-05 ***
## goal_20(1.8e+03,2e+03]   -0.099882   0.012297  -8.123 4.67e-16 ***
## goal_20(2e+03,2.5e+03]   -0.103849   0.012195  -8.516  < 2e-16 ***
## goal_20(2.5e+03,3e+03]   -0.103880   0.011781  -8.818  < 2e-16 ***
## goal_20(3e+03,3.5e+03]   -0.108168   0.014221  -7.606 2.87e-14 ***
## goal_20(3.5e+03,4.5e+03] -0.135934   0.012492 -10.881  < 2e-16 ***
## goal_20(4.5e+03,5e+03]   -0.204050   0.010732 -19.013  < 2e-16 ***
## goal_20(5e+03,5.2e+03]   -0.250929   0.062445  -4.018 5.87e-05 ***
## goal_20(5.2e+03,7e+03]   -0.184895   0.011706 -15.795  < 2e-16 ***
## goal_20(7e+03,8e+03]     -0.185059   0.013565 -13.642  < 2e-16 ***
## goal_20(8e+03,1e+04]     -0.241528   0.011069 -21.821  < 2e-16 ***
## goal_20(1e+04,1.2e+04]   -0.226823   0.017024 -13.324  < 2e-16 ***
## goal_20(1.2e+04,1.6e+04] -0.251951   0.012943 -19.467  < 2e-16 ***
## goal_20(1.6e+04,2.5e+04] -0.312792   0.011810 -26.485  < 2e-16 ***
## goal_20(2.5e+04,5e+04]   -0.366339   0.012601 -29.072  < 2e-16 ***
## goal_20(5e+04,2.15e+07]  -0.474328   0.013739 -34.525  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2340022)
## 
##     Null deviance: 12174  on 49349  degrees of freedom
## Residual deviance: 11543  on 49330  degrees of freedom
## AIC: 68394
## 
## Number of Fisher Scoring iterations: 2

3.5 $launched_at

The number of projects increased exponentially 2009 - 2012 and seems to be increasing more gradually after 2012. The dataset has only partial data for December 2013.

df_engr %>%
  group_by(mo_yr_launched)  %>% 
  summarise(n(), 
            success_rate = round(mean(funded)*100, 2),
            mean(percent_funded, na.rm = TRUE), 
            mean (avg_contribution, na.rm = TRUE)) %>%
  ungroup(mo_yr_launched)
## # A tibble: 57 x 5
##    mo_yr_launched `n()` success_rate `mean(percent_fun… `mean(avg_contrib…
##    <chr>          <int>        <dbl>              <dbl>              <dbl>
##  1 01-2010          118         58.5               82.4               68.3
##  2 01-2011          660         62.3               88.5               70.5
##  3 01-2012         1227         53.6              128.                67.1
##  4 01-2013         1388         55.6              124.                64.7
##  5 02-2010          139         62.6               89.9               75.1
##  6 02-2011          578         62.1               85.7               76.4
##  7 02-2012         1494         58.4              324.                71.9
##  8 02-2013         1435         59.0              121.                69.9
##  9 03-2010          209         65.1               89.6               78.5
## 10 03-2011          477         66.0              103.                76.7
## # ... with 47 more rows
df_engr %>% 
  ggplot(aes(x = yr_launched, y = funded)) +
  geom_bar(stat="identity", fill = "#332288") +
    labs(
    title = "Project Launch By Year",
    x="Year", 
    y="Projects Launched") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

df_engr %>% 
  ggplot(aes(x = yr_launched, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
    labs(
    title = "Funding By Year",
    x="Year", 
    y="Chance of Funding") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

df_engr %>% 
  ggplot(aes(x = mo_launched, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
    labs(
    title = "Funding By Goal",
    x="Goal Amount", 
    y="Chance of Funding") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

summary(glm(funded ~ mo_launched, data = df_engr))
## 
## Call:
## glm(formula = funded ~ mo_launched, data = df_engr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5938  -0.5554   0.4172   0.4446   0.4679  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.562629   0.008523  66.012  < 2e-16 ***
## mo_launched02  0.031172   0.011843   2.632  0.00848 ** 
## mo_launched03  0.020148   0.011560   1.743  0.08137 .  
## mo_launched04  0.005669   0.011639   0.487  0.62622    
## mo_launched05 -0.023772   0.011199  -2.123  0.03378 *  
## mo_launched06 -0.006424   0.011226  -0.572  0.56719    
## mo_launched07 -0.011455   0.011280  -1.015  0.30989    
## mo_launched08 -0.020658   0.011352  -1.820  0.06880 .  
## mo_launched09 -0.007231   0.011434  -0.632  0.52716    
## mo_launched10 -0.002315   0.011176  -0.207  0.83588    
## mo_launched11 -0.012882   0.011260  -1.144  0.25261    
## mo_launched12 -0.030529   0.012873  -2.371  0.01772 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2464786)
## 
##     Null deviance: 12174  on 49349  degrees of freedom
## Residual deviance: 12161  on 49338  degrees of freedom
## AIC: 70950
## 
## Number of Fisher Scoring iterations: 2

3.6 location variables

3.6.1 $country

The majority of projects are based in the United States. Domestic projects have a success rate about 7% higher than international projects.

df_country <- data.frame(funded = df_engr$funded, country = db_cleaned$country, usa = df_engr$usa)

df_country %>%
  group_by(country) %>%
    summarise(count = n(), funded_rate = mean(funded)) %>%
  arrange(desc(count)) %>%
  ungroup(country)
## # A tibble: 5 x 3
##   country count funded_rate
##   <fct>   <int>       <dbl>
## 1 US      47007       0.561
## 2 GB       1986       0.492
## 3 CA        278       0.464
## 4 AU         54       0.556
## 5 NZ         25       0.52
summary(lm(funded ~ usa, data = df_country))
## 
## Call:
## lm(formula = funded ~ usa, data = df_country)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5610 -0.5610  0.4390  0.4390  0.5096 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49040    0.01026  47.814  < 2e-16 ***
## usa          0.07058    0.01051   6.717 1.88e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 49348 degrees of freedom
## Multiple R-squared:  0.0009133,  Adjusted R-squared:  0.0008931 
## F-statistic: 45.11 on 1 and 49348 DF,  p-value: 1.88e-11
#Clean up
rm(df_country)

3.7 $photo_key

There are not enough zero values for prediction, and a t-test shows no significant difference between having and not having a photo. This is probably not because photos don’t matter, but rather because the sample with no photo is too small to have statistical power.

df_engr %>%
  group_by(photo_key) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(photo_key)
## # A tibble: 2 x 3
##   photo_key `n()` `mean(funded)`
##       <dbl> <int>          <dbl>
## 1         0    25          0.68 
## 2         1 49325          0.558
t.test(funded ~ photo_key, data = df_engr)
## 
##  Welch Two Sample t-test
## 
## data:  funded by photo_key
## t = 1.2854, df = 24.026, p-value = 0.2109
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07413234  0.31899802
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6800000       0.5575672

3.8 $pledged

The higher the pledged amount, the more backers there are. The average contribution per backer does not seem to change much even as the amount pledged increases.

# # Group into deciles and look as levels
# df_filtered_by_pledged <- db_cleaned %>%
#   filter(pledged < 200000) %>%
#   mutate(pledged_10 = cut(pledged, 10))
# 
# #df_filtered_by_pledged %>%
#   group_by(pledged_10) %>%
#   summarise(n(), 
#             mean(pledged), 
#             mean(backers_count), 
#             mean(avg_contribution, na.rm = TRUE)) %>%
#   ungroup(pledged_10)
# 
# df_filtered_by_pledged %>%
#   ggplot() + 
#   theme_minimal() + 
#   ggtitle("$pledged in Deciles") + 
#   geom_histogram(aes(pledged_10), stat = "count")
# 
# #Clean up
# rm(df_filtered_by_pledged)

3.9 $spotlight

Spotlight is an ex post facto variable that features funded projects. Kickstarter Spotlight

df_engr %>%
  group_by(spotlight) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(spotlight)
## # A tibble: 2 x 3
##   spotlight `n()` `mean(funded)`
##       <dbl> <int>          <dbl>
## 1         0 21831              0
## 2         1 27519              1

3.10 $staff_pick

This flag denotes projectsthat have received the “Projects We Love” badge and get prominantly features on the website, newsletters, and blogs. Kickstarter staff clearly has a great eye for promossing projects and/or some strong marketing impact. Kickstarter - Projects We Love

df_engr %>%
  group_by(staff_pick) %>%
    summarise(n(), mean(funded)) %>% 
  ungroup(staff_pick)
## # A tibble: 2 x 3
##   staff_pick `n()` `mean(funded)`
##        <dbl> <int>          <dbl>
## 1          0 44081          0.524
## 2          1  5269          0.842

3.11 $funded

On average: projects that are funded far exceed their goal amount; the average contribution per backer is higher and there are more backers; and the goal amount tends to be significantly lower.

df_engr %>%
  group_by(funded) %>%
  summarise(n(),
            mean(percent_funded, na.rm = TRUE), 
            mean(avg_contribution, na.rm = TRUE), 
            mean(goal, na.rm = TRUE),
            mean(backers_count), 
            median(backers_count), 
            max(backers_count), 
            mean(avg_contribution, na.rm = TRUE)) %>%
  ungroup(funded)
## # A tibble: 2 x 8
##   funded `n()` `mean(percent_funde… `mean(avg_contribut… `mean(goal, na.r…
##    <dbl> <int>                <dbl>                <dbl>             <dbl>
## 1      0 21831                 10.6                 57.2            26047.
## 2      1 27519                577.                  82.9             8453.
## # ... with 3 more variables: `mean(backers_count)` <dbl>,
## #   `median(backers_count)` <int>, `max(backers_count)` <dbl>

3.12 $video_status

T test shows having a video to statistically significantly impact the success of the project.

df_engr %>%
  group_by(video_status) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(video_status)
## # A tibble: 2 x 3
##   video_status `n()` `mean(funded)`
##          <dbl> <int>          <dbl>
## 1            0  8882          0.408
## 2            1 40468          0.591
t.test(funded ~ video_status, data = df_engr)
## 
##  Welch Two Sample t-test
## 
## data:  funded by video_status
## t = -31.728, df = 13074, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1940137 -0.1714361
## sample estimates:
## mean in group 0 mean in group 1 
##        0.407791        0.590516

3.13 Social media connectedness variables

Social media shows an impact. Facebook seems to be the strongest and Youtube has a negative coefecient. Our hypothesis is that Facebook and Twitter may be used for promotion, while creators focusing on YouTube may over rely on their product content. Yet the most successfull creators have all three, which supports that YouTube is effective when paired with a comprehensive social media campaign.

#facebook
df_engr %>%
  group_by(facebook) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(facebook)
## # A tibble: 2 x 3
##   facebook `n()` `mean(funded)`
##      <dbl> <int>          <dbl>
## 1        0 36654          0.541
## 2        1 12696          0.605
#twiter
df_engr %>%
  group_by(twitter) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(twitter)
## # A tibble: 2 x 3
##   twitter `n()` `mean(funded)`
##     <dbl> <int>          <dbl>
## 1       0 46741          0.554
## 2       1  2609          0.617
#youtube
df_engr %>%
  group_by(youtube) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(youtube)
## # A tibble: 2 x 3
##   youtube `n()` `mean(funded)`
##     <dbl> <int>          <dbl>
## 1       0 45570          0.560
## 2       1  3780          0.528
#social_media
df_engr %>%
  group_by(social_media) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(social_media)
## # A tibble: 2 x 3
##   social_media `n()` `mean(funded)`
##          <dbl> <int>          <dbl>
## 1            0 48771          0.557
## 2            1   579          0.615
lm_social_media <- lm(funded ~ social_media, data = df_engr)
summary(lm_social_media)
## 
## Call:
## lm(formula = funded ~ social_media, data = df_engr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6149 -0.5569  0.4430  0.4430  0.4430 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.556950   0.002249 247.660  < 2e-16 ***
## social_media 0.057903   0.020762   2.789  0.00529 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4966 on 49348 degrees of freedom
## Multiple R-squared:  0.0001576,  Adjusted R-squared:  0.0001373 
## F-statistic: 7.778 on 1 and 49348 DF,  p-value: 0.00529
#social_media_count
df_engr %>%
  group_by(social_media_count) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(social_media_count)
## # A tibble: 4 x 3
##   social_media_count `n()` `mean(funded)`
##                <dbl> <int>          <dbl>
## 1                  0 34484          0.543
## 2                  1 11226          0.594
## 3                  2  3061          0.581
## 4                  3   579          0.615
lm_social_media_count <- lm(funded ~ social_media_count, data = df_engr)
summary(lm_social_media_count)
## 
## Call:
## lm(formula = funded ~ social_media_count, data = df_engr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6372 -0.5458  0.4237  0.4541  0.4541 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.545852   0.002592 210.568   <2e-16 ***
## social_media_count 0.030452   0.003400   8.956   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4963 on 49348 degrees of freedom
## Multiple R-squared:  0.001623,   Adjusted R-squared:  0.001602 
## F-statistic:  80.2 on 1 and 49348 DF,  p-value: < 2.2e-16
df_engr %>% 
  ggplot(aes(x = social_media_count, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean") +
  theme_minimal()

#Clean up
rm(lm_social_media_count, lm_social_media)

3.14 $campaign_duration

Interestingly, campaign duration has an inverse relationship to the likelihood of receiving funding; longer campaign are associated with higher failure rates.

#Linear Model
#not significant
lm_length <- glm(percent_funded ~ campaign_duration, data = df_engr)
summary(lm_length)
## 
## Call:
## glm(formula = percent_funded ~ campaign_duration, data = df_engr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
##    -497     -330     -250     -197  4153152  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        506.246    273.642   1.850   0.0643 .
## campaign_duration   -5.074      7.139  -0.711   0.4773  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 530257105)
## 
##     Null deviance: 2.6167e+13  on 49349  degrees of freedom
## Residual deviance: 2.6167e+13  on 49348  degrees of freedom
## AIC: 1131439
## 
## Number of Fisher Scoring iterations: 2
#Logit
#significant
logit_length <- glm(funded ~ campaign_duration, data = df_engr, family=binomial(link='logit'))
summary(logit_length)
## 
## Call:
## glm(formula = funded ~ campaign_duration, family = binomial(link = "logit"), 
##     data = df_engr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5155  -1.3146   0.9841   1.0450   1.4678  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.7946662  0.0243907   32.58   <2e-16 ***
## campaign_duration -0.0158282  0.0006355  -24.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67757  on 49349  degrees of freedom
## Residual deviance: 67121  on 49348  degrees of freedom
## AIC: 67125
## 
## Number of Fisher Scoring iterations: 4
df_engr %>%
  ggplot() +
  theme_minimal() + 
  ggtitle("$campaign_length Logistic Model") +
  geom_point(aes(x=campaign_duration, y=funded), colour = "black") +
  geom_point(aes(x=campaign_duration, y=logit_length$fitted.values), colour = "red")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

rm(lm_length, logit_length)

3.15 $avg_contribution

Appears to have a skewed normal distribution with a mean of $72.

#Chck for NA
anyNA(df_engr$avg_contribution)
## [1] TRUE
mean(df_engr$avg_contribution, na.rm = TRUE)
## [1] 72.70791
# Apparently normal distribution
df_filtered_by_avg_contribution <- df_engr %>%
  filter(avg_contribution < 1500)

# Box Plot
boxplot(df_filtered_by_avg_contribution$avg_contribution)

# Histogram
df_filtered_by_avg_contribution %>%
  ggplot() + 
  geom_histogram(aes(x = avg_contribution))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Clean up
rm(df_filtered_by_avg_contribution)

3.16 $percent_funded

Non-normal distribution shows that most projects over ~75% of their goal wind up being successful. There may be outside manipulation happening due to Kickstarter promotions of projects that are near their goals, or personal donations by the creators and/or creators personal connections.

# Check for NA
anyNA(df_engr$percent_funded)
## [1] FALSE
mean(df_engr$percent_funded, na.rm = TRUE)
## [1] 326.2655
df_filtered_by_percent_funded <- df_engr %>%
  filter(!is.na(percent_funded) & percent_funded <= 400)

# Box Plot
boxplot(df_filtered_by_percent_funded$percent_funded)

# Non-normal distribution
df_filtered_by_percent_funded %>% 
  ggplot() + 
  geom_histogram(aes(x = percent_funded))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Clean up
rm(df_filtered_by_percent_funded)

3.17 $description_length

df_engr %>% 
  ggplot(aes(x = description_length_10, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
    labs(
    title = "Funding By Description",
    x="Length of Description", 
    y="Chance of Funding") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

3.18 Profile variables

$profile_blurb, $profile_state

3.19 $rewards

df_engr %>% 
  ggplot(aes(x = reward_length_10, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
    labs(
    title = "Funding By Rewards",
    x="Length of Rewards", 
    y="Chance of Funding") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

3.20 $updates_count

3.21 Date variables

#Average of 2-20 minutes difference btween deadline and failed_at, successful_at, state_changed_at
#Difference is not meaningful, so remove failed_at, successful_at, state_changed_at
ticktock <- data.frame(db_cleaned$created_at, 
                       db_cleaned$deadline, 
                       db_cleaned$failed_at, 
                       db_cleaned$launched_at, 
                       db_cleaned$state_changed_at, 
                       db_cleaned$successful_at)

ticktock <- mutate(ticktock, 
                   deadline_failed = db_cleaned.deadline - db_cleaned.failed_at,
                   deadline_success = db_cleaned.deadline - db_cleaned.successful_at,
                   deadline_state = db_cleaned.deadline - db_cleaned.state_changed_at)

mean(ticktock$deadline_failed, na.rm = TRUE)
## Time difference of -52.16174 secs
mean(ticktock$deadline_success, na.rm = TRUE)
## Time difference of -112.5382 secs
mean(ticktock$deadline_state)
## Time difference of -85.82944 secs
rm(ticktock)

3.22 Not Fully Explored

More granular location variables would require more cleaning and may produce regional insights. * location_name * location_state * location_type

3.23 Rejected

We looked at this, yet did not find them to be predictive: * project_id * disable_communicaiton

##$project_id
#A random identifier, cannot easily observe a pattern
range(db_cleaned$project_id)
## [1]      21109 2147466649

4 Text Analysis

The variable full_description contains the complete project description from kickstarter. Unstructured data, such as this text variable, require more cleaning and transformation to be useful, but can potentially be a source of rich information. Our application of text analysis had three primary motives: 1. Examine word frequency with word counts and wordclouds 2. Contruct topic models 3. Binary calssification to predict project funding status

4.1 Examine Word Frequency

We began by transforming the strings of text in full_description into a data frame with one word per row. We then removed English stop words, common words that carry little semantic meaning and are thus immaterial to analyses (e.g., “and”, “the”, “of”). Finally, we determined word counts for: * the entire dataset, * only successful projects, and * only failed projects.

# Most frequently used words in descriptions, overall
fd_text_tidy <- data_frame(id = db_cleaned$project_id, text = db_cleaned$full_description) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
fd_text_tidy %>%
  filter(n > 25000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

# Most frequently used words in successful descriptions
success_fd_text <- filter(db_cleaned, state == "successful")
success_fd_text <- data_frame(id = success_fd_text$project_id, text = success_fd_text$full_description)
success_fd_text_tidy <- success_fd_text %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"
#  count(word, sort = TRUE)

# Most frequently used words in failed descriptions
fail_fd_text <- filter(db_cleaned, state == "failed")
fail_fd_text <- data_frame(id = fail_fd_text$project_id, text = fail_fd_text$full_description)
fail_fd_text_tidy <- fail_fd_text %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"
#  count(word, sort = TRUE)

Next, we examined the correlation between word proportions of successful and failed project descriptions. Word proportion represents the percentage of time that a given word is used out of the total number of words in the document. In this case, the documents are the collection of all successful project descriptions and all failed project descriptions.

We observed, both visually and in terms of Pearson’s correlation coefficient, that the terms used in successful and failed project descriptions were very similar.

# Where is there word overlap between successful and failed projects?
frequency <- bind_rows(mutate(success_fd_text_tidy, funded = "successful"),
                       mutate(fail_fd_text_tidy, funded = "failed")) %>% 
  count(funded, word) %>%
  group_by(funded) %>%
  mutate(proportion = n/sum(n)) %>% 
  select(-n) %>% 
  spread(funded, proportion) %>% 
  gather(funded, proportion, `successful`)

ggplot(frequency, aes(x = proportion, y = `failed`, color = abs(`failed` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.5, size = 1.5, width = 0.1, height = 0.1) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 0.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~funded, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "failed", x = "successful")
## Warning: Removed 601449 rows containing missing values (geom_point).
## Warning: Removed 601449 rows containing missing values (geom_text).

cor.test(data = frequency[frequency$funded == "successful",], ~ proportion + `failed`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and failed
## t = 1167.9, df = 111150, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.961139 0.962025
## sample estimates:
##       cor 
## 0.9615845

Another way to visualize word frequency is by constructing wordclouds, which scale the size of text of a word to match its frequency in the document relative to other words’ frequencies. We constructed a wordcloud for the descriptions from the entire dataset. We were not surprised to see that “project”, “kickstarter”, and “goal” were among the most frequent terms used.

# Create a corpus
# Transform text to lowercase, remove punctuation, remove stop words
description_corpus <- VCorpus(VectorSource(db_cleaned$full_description))
description_corpus <- tm_map(description_corpus, content_transformer(tolower))
description_corpus <- tm_map(description_corpus, removePunctuation)
description_corpus <- tm_map(description_corpus, PlainTextDocument)
description_corpus <- tm_map(description_corpus, removeWords, stopwords('english'))

pal <- brewer.pal(9, "BuPu")
wordcloud(description_corpus, max.words = 100, random.order = FALSE, colors = pal, ordered.color = FALSE)

Wordclouds can be a useful way to visually observe differences in word variety and frequency between different groups of documents. Although they cannot be used in subsequent modeling, they are a tool for understanding unstructured text data and formulating hypotheses.

Therefore, we grouped our dataset into documents * by year to identify trends over time, and * by funded to identify differences between successful and failed projects.

Prior to generating the wordclouds, we also created a custom set of stop words to weed out common terms in our dataset that could mask points of distinction between documents.

In the wordclouds by year, we see that music was initially the most prevalent in 2009, but film began to emerge as the predominant category 2010-2011. In 2012-2013, games appear as the biggest category. These wordclouds also give us a hint regarding the variety of projects. From 2009-2011, the wordclouds become larger and word frequency is less concentrated around the same terms. Abruptly in 2012, the projects seems to become less disparate, but in 2013 variety increases again. This suggests that the degree of project variety on kickstarter may be cyclical; this seems logical as artists and entrepreneurs in the same field turn to kickstarter after hearing about colleagues’ successes. However, more years of data are needed to verify the hypothesis of three-year periodicity.

# Words to exclude
exclude <- c("will", "can", "get", "make", "project", "time", "people", "kickstarter", "one", "goal", "money", "support", "help", "new", "like", "just", "first", "also", "like", "still", "really", "already", "ive", "weve", "dont", "well", "want", "need", "around", "include", "including", "two", "four", "three", "last", "thats", "youll", "currently", "others", "extra", "without", "within", "ever", "days", "months", "2012", "cant", "second", "100", "wanted", "fund", "funding", "able", "additional")

# Initialize year variables for `for` loop
Year <- as.numeric(as.character(format(db_cleaned$launched_at, format="%Y")))
YearUnique <- sort(unique(Year))

# Create corpora and wordclouds by year
for(i in 1:length(Year)) {
  ind <- which(Year == YearUnique[i])
  if(length(ind) >= 10) {
    description_corpus <- VCorpus(VectorSource(db_cleaned$full_description[ind]))
    description_corpus <- tm_map(description_corpus, content_transformer(tolower))
    description_corpus <- tm_map(description_corpus, removePunctuation)
    description_corpus <- tm_map(description_corpus, PlainTextDocument)
    description_corpus <- tm_map(description_corpus, removeWords, c(exclude, stopwords('english')))
    
    print(YearUnique[i])
    pal <- brewer.pal(9, "BuPu")
    wordcloud(description_corpus, max.words = 100, random.order = FALSE, colors = pal, ordered.color = FALSE)
  }
}
## [1] 2009
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : process could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : name could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : everyone could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : creative could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : public could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : enough could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : festival could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : reach could not be fit on page. It will not be plotted.

## [1] 2010

## [1] 2011

## [1] 2012
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : experience could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : everyone could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : recording could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : receive could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : bring could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : band could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : family could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : stories could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : campaign could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : name could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : release could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : special could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : sound could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : unique could not be fit on page. It will not be plotted.
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : keep could not be fit on page. It will not be plotted.

## [1] 2013
## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : community could not be fit on page. It will not be plotted.

## Warning in wordcloud(description_corpus, max.words = 100, random.order =
## FALSE, : keep could not be fit on page. It will not be plotted.

In the wordclouds by funding status, we observed a high degree of similarity in both terms and frequency between successful and failed projects. Books seemed more likely to fail due to the higher prevalence of “book” in the failed wordcloud. There also seemed to be more variety in the successful wordcloud, perhaps indicating richer project descriptions. However, it seems that high world frequency may not be the best delineator of successful versus failed projects.

# Initialize success variables for `for` loop
Success <- db_cleaned$state
SuccessUnique <- sort(unique(Success))

# Create corpora and wordclouds by success
for(i in 1:length(Success)){
  ind <- which(Success == SuccessUnique[i])
  if(length(ind) >= 10) {
    description_corpus <- VCorpus(VectorSource(db_cleaned$full_description[ind]))
    description_corpus <- tm_map(description_corpus, content_transformer(tolower))
    description_corpus <- tm_map(description_corpus, removePunctuation)
    description_corpus <- tm_map(description_corpus, PlainTextDocument)
    description_corpus <- tm_map(description_corpus, removeWords, c(exclude, stopwords('english')))
    
    print(SuccessUnique[i])
    pal <- brewer.pal(9, "BuPu")
    wordcloud(description_corpus, max.words = 100, random.order = FALSE, colors = pal, ordered.color = FALSE)
  }
}
## [1] "failed"

## [1] "successful"

Sometimes the best way to determine points of difference between two similar documents are the terms which are unique between the two documents, rather than the most frequent terms. For example, two books written by the same author would likely generate similar wordclouds, yet the unique characters and places in the books would enable us to detect which book is which.

To see if this might be the case in our collection of successful and failed projects, we examined the term frequency-inverse document frequency (tf-idf). tf looks for terms that are common; idf decreases the weight placed on commonly used terms in the collection and increases the weight placed on words that are not commonly used in the collection (i.e., common in a few documents). To remove nonsensical words from the analysis, we only considered words with a frequency of greater than 500, which is a reasonably low cutoff in a dataset with 700,000+ unique terms.

The results of this analysis suggest that board games and film are likely to be successful (dice, unlocked, filmmaker(s), boards, filmmaking, premiere). However, although the games category overall had a high success rate, it appears that games involving war and violence were less likely to receive funding (weapon, battles, security, agent), as were online games (multiplayer, server, playable, modes, animations).

# Combine successful and failed documents in one data frame that considers document frequency and total frequency
funded_words <- bind_rows(mutate(success_fd_text_tidy, funded = "successful"),
                          mutate(fail_fd_text_tidy, funded = "failed")) %>%
  count(funded, word, sort = TRUE) %>%
  filter(n > 500) %>%
  ungroup()

total_words <- funded_words %>% 
  group_by(funded) %>% 
  summarize(total = sum(n))

funded_words <- left_join(funded_words, total_words)
## Joining, by = "funded"
# Compute tf-idf
funded_words <- funded_words %>%
  bind_tf_idf(word, funded, n) %>%
  select(-total) %>%
  arrange(desc(tf_idf))

# Visualize
funded_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(funded) %>% 
  top_n(20) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = funded)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~funded, ncol = 2, scales = "free") +
  coord_flip()
## Selecting by tf_idf

4.2 Topic Modeling

The analyses in the previous section have focused on the “bag-of-words” approach and word frequency as a method for natural language processing, the means by which computers make sense of human language. Although this is a common and useful approach, there are other useful ways to describe text data.

One such method is topic modeling. Topic models assume that word or groups of words (called n-grams) which appear frequently together in a dataset are explained by underlying, unobserved groups(called topics). By examining word or n-gram overlap in the documents comprising a dataset, these topics can be detected. Although the computer cannot provide a semantic label for the topics, a human who is familiar with the dataset could examine the top words and determine a theme.

4.3 Latent Dirichlet allocation

We chose Latent Dirichlet allocation (LDA) as our statistical model for topic detection. LDA examines text by word frequency and co-occurence in documents, which are individual project descriptions in our case. LDA assumes that each document covers a small number of topics and a small set of words it uses frequently, and so it is good at assigning documents to topics.

To feed data into the model, we first processed the text to transform it to lowercase, remove punctuation, and remove stop words. In this section, we also performed word stemming, which groups words together that have the same root but different suffixes. This process helps ensure that words with the same semantic meaning, but different verb conjugations and the like, are assessed as the same word. As a result, our results show some incomplete word stems.

After processing the text, we used it to generate documents, a vocabulary of terms in the dataset, metadata to construct the model. Consistent with our tf-idf analysis above, we only considered terms that appeared in at least 500 documents. We ran iterations of the LDA model specifying both 20 and 40 topics. The model did not reach convergence over 10 or 20 iterations; however, meaningful topics emerged with 20 iterations over 40 topics.

Visualizing the results of our topic model, we see some meaningful topics emerge, some centered on the mechanisms of the platform, and others identifying product categories or subcategories.

For example, Topic 1 could be labeled Funding Requests and includes terms like “help”, “money”, “donate”, “dollar”, “goal”, “buck”, and “reach”. Topic 7 is all about the rewards provided to backers if the project is successful: “pledge”, “reward”, “level”, “backer”, “goal”, and “ship.”

On the other hand, Topic 6 seems to describe a certain subcategory of Film & Video and could be labeled Web Series with terms like “show”, “series”, “episode”, “season”, “pilot”, “web”, “anime”, and “comedy.” Topic 24 seems to describe a subcategory of Publishing and could be labeled Periodicals with terms like “issue”, “magazine”, “media”, “interview”, “article”, “content”, “journal”, and “print.”

The theme of the projects is clear from some topics, although the type of project is not easily distinguished. For example, Topic 30 is about Fantasy, but could span several types of projects. The same is true of Topic 31 (Christianity), Topic 35 (Family), and Topic 39 (Outer Space).

We also visualized the correlations between the 40 topics. The green nodes indicate topics, and the dashed lines represent relatedness between topics. The length of the dashed lines indicate the degree of similarity between two topics. Our topic models are highly related to one another, both in terms of the number of connections and the distance of connections.

# Describe topics
labelTopics(poliblogPrevFit, c(1:40))
## Topic 1 Top Words:
##       Highest Prob: back, one, said, never, like, day, know 
##       FREX: said, knew, walk, went, felt, came, laugh 
##       Lift: cri, dad, sat, smile, said, didn’t, knew 
##       Score: said, knew, never, man, walk, laugh, back 
## Topic 2 Top Words:
##       Highest Prob: life, stori, live, person, famili, peopl, chang 
##       FREX: struggl, life, mother, emot, fear, mental, suffer 
##       Lift: mental, drug, depress, abus, loss, suffer, violenc 
##       Score: stori, life, live, famili, mother, struggl, emot 
## Topic 3 Top Words:
##       Highest Prob: light, black, hand, water, room, red, door 
##       FREX: door, water, sun, red, moon, floor, ring 
##       Lift: sun, door, floor, float, moon, ring, leg 
##       Score: black, water, light, red, door, blue, fire 
## Topic 4 Top Words:
##       Highest Prob: school, learn, student, children, kid, educ, teach 
##       FREX: children, student, teach, school, kid, educ, teacher 
##       Lift: classroom, teach, teacher, children, student, school, taught 
##       Score: school, student, children, kid, educ, learn, teach 
## Topic 5 Top Words:
##       Highest Prob: print, design, color, imag, paper, will, poster 
##       FREX: print, paper, color, poster, size, imag, printer 
##       Lift: ink, handmad, postcard, printer, paper, print, color 
##       Score: print, color, design, poster, paper, imag, printer 
## Topic 6 Top Words:
##       Highest Prob: year, time, first, work, one, two, now 
##       FREX: year, month, ago, last, spent, began, hour 
##       Lift: ago, nine, sum, spent, year, began, undertak 
##       Score: year, time, month, ago, last, two, began 
## Topic 7 Top Words:
##       Highest Prob: get, just, thing, good, like, one, can 
##       FREX: guy, bad, good, nice, throw, serious, thing 
##       Lift: buck, ice, throw, nice, blind, weird, bad 
##       Score: get, just, thing, guy, good, lot, bad 
## Topic 8 Top Words:
##       Highest Prob: love, friend, thank, help, support, famili, amaz 
##       FREX: friend, love, thank, amaz, famili, grate, wonder 
##       Lift: grate, dear, sincer, generos, babi, fabul, friend 
##       Score: love, thank, friend, famili, support, help, share 
## Topic 9 Top Words:
##       Highest Prob: ship, add, level, will, pledg, stretch, goal 
##       FREX: stretch, add, ship, zombi, monster, box, level 
##       Lift: miniatur, zombi, dragon, stretch, creatur, monster, add 
##       Score: ship, stretch, pledg, add, zombi, level, miniatur 
## Topic 10 Top Words:
##       Highest Prob: book, publish, stori, write, read, novel, cover 
##       FREX: book, publish, novel, author, reader, ebook, read 
##       Lift: paperback, ebook, e-book, manuscript, self-publish, book, novel 
##       Score: book, publish, novel, stori, write, reader, author 
## Topic 11 Top Words:
##       Highest Prob: will, project, fund, use, anim, complet, also 
##       FREX: anim, fund, complet, main, project, will, graphic 
##       Lift: anim, portion, main, intend, outlin, copyright, background 
##       Score: will, anim, project, fund, use, video, complet 
## Topic 12 Top Words:
##       Highest Prob: dont, get, like, want, realli, ive, make 
##       FREX: that, dont, your, ive, ill, there, cant 
##       Lift: itll, youv, there, your, havent, that, theyr 
##       Score: ive, dont, that, weve, youll, realli, your 
## Topic 13 Top Words:
##       Highest Prob: can, team, work, experi, creat, develop, idea 
##       FREX: model, team, type, resourc, task, virtual, idea 
##       Lift: zero, task, tech, virtual, model, advantag, optim 
##       Score: team, model, develop, build, experi, idea, can 
## Topic 14 Top Words:
##       Highest Prob: record, album, song, studio, releas, master, mix 
##       FREX: album, record, song, master, vinyl, studio, mix 
##       Lift: album, vinyl, cds, duplic, record, song, acoust 
##       Score: album, record, song, studio, guitar, releas, cds 
## Topic 15 Top Words:
##       Highest Prob: will, backer, reward, updat, get, free, new 
##       FREX: backer, updat, free, list, post, note, vote 
##       Lift: vote, updat, faq, backer, preview, logo, scroll 
##       Score: backer, reward, updat, get, free, kickstart, will 
## Topic 16 Top Words:
##       Highest Prob: will, reward, digit, copi, receiv, special, includ 
##       FREX: digit, plus, sign, t-shirt, copi, exclus, special 
##       Lift: sticker, autograph, t-shirt, plus, exclus, christma, holiday 
##       Score: copi, reward, digit, sign, t-shirt, plus, exclus 
## Topic 17 Top Words:
##       Highest Prob: art, artist, work, paint, will, creat, exhibit 
##       FREX: paint, art, exhibit, galleri, artist, canva, oil 
##       Lift: canva, galleri, paint, exhibit, oil, painter, art 
##       Score: art, paint, artist, exhibit, galleri, work, artwork 
## Topic 18 Top Words:
##       Highest Prob: war, world, power, battl, fight, hero, system 
##       FREX: war, planet, enemi, battl, fight, attack, hero 
##       Lift: soldier, planet, destruct, threat, destroy, war, empir 
##       Score: war, enemi, battl, fight, weapon, attack, hero 
## Topic 19 Top Words:
##       Highest Prob: world, god, women, girl, young, stori, men 
##       FREX: god, women, men, girl, woman, faith, mysteri 
##       Lift: jesus, god, religion, sexual, women, sex, husband 
##       Score: women, god, girl, woman, men, young, stori 
## Topic 20 Top Words:
##       Highest Prob: issu, magazin, inform, research, social, public, media 
##       FREX: magazin, issu, research, polit, articl, social, journal 
##       Lift: magazin, articl, issu, journal, journalist, confer, polit 
##       Score: magazin, issu, research, social, public, media, inform 
## Topic 21 Top Words:
##       Highest Prob: help, need, money, make, get, cost, rais 
##       FREX: money, need, pay, help, rais, donat, dollar 
##       Lift: money, cash, pay, paid, dollar, afford, expens 
##       Score: money, help, need, get, donat, rais, make 
## Topic 22 Top Words:
##       Highest Prob: new, john, david, michael, mark, jame, chris 
##       FREX: john, paul, jame, michael, joe, david, robert 
##       Lift: paul, stephen, charl, sam, john, jone, georg 
##       Score: john, david, michael, jame, paul, chris, mike 
## Topic 23 Top Words:
##       Highest Prob: music, perform, musician, danc, concert, compos, sound 
##       FREX: concert, jazz, perform, danc, compos, piano, musician 
##       Lift: orchestra, jazz, ensembl, piano, dancer, concert, opera 
##       Score: music, perform, concert, musician, jazz, danc, compos 
## Topic 24 Top Words:
##       Highest Prob: american, documentari, countri, travel, nation, cultur, state 
##       FREX: american, america, documentari, river, nation, south, countri 
##       Lift: africa, african, mexico, river, western, america, valley 
##       Score: documentari, american, cultur, nation, america, countri, interview 
## Topic 25 Top Words:
##       Highest Prob: music, band, play, video, fan, new, rock 
##       FREX: band, rock, music, fan, radio, tour, label 
##       Lift: punk, band, gig, nashvill, rock, radio, label 
##       Score: music, band, tour, rock, play, fan, radio 
## Topic 26 Top Words:
##       Highest Prob: communiti, event, local, citi, will, public, organ 
##       FREX: communiti, event, park, local, particip, organ, workshop 
##       Lift: neighborhood, non-profit, urban, workshop, downtown, communiti, resid 
##       Score: communiti, event, local, citi, public, park, organ 
## Topic 27 Top Words:
##       Highest Prob: citi, tour, travel, will, new, road, food 
##       FREX: san, road, francisco, juli, august, summer, coast 
##       Lift: portland, hotel, oregon, francisco, beach, coast, san 
##       Score: tour, citi, san, travel, road, food, trip 
## Topic 28 Top Words:
##       Highest Prob: project, will, support, share, inspir, dream, creat 
##       FREX: inspir, dream, creativ, passion, vision, share, financi 
##       Lift: meaning, endeavor, aspir, gratitud, vision, fruition, strive 
##       Score: inspir, dream, project, share, support, creativ, creat 
## Topic 29 Top Words:
##       Highest Prob: film, festiv, produc, director, work, product, featur 
##       FREX: director, festiv, theater, los, theatr, award, angel 
##       Lift: hollywood, theater, theatr, theatric, director, actress, festiv 
##       Score: film, festiv, director, filmmak, produc, los, product 
## Topic 30 Top Words:
##       Highest Prob: film, movi, crew, shoot, product, cast, short 
##       FREX: crew, movi, shoot, cast, film, camera, prop 
##       Lift: wardrob, makeup, prop, crew, costum, shoot, movi 
##       Score: film, crew, movi, cast, shoot, camera, actor 
## Topic 31 Top Words:
##       Highest Prob: will, want, can, like, look, idea, one 
##       FREX: idea, feel, pictur, word, someth, look, differ 
##       Lift: pen, coffe, letter, mine, quot, pictur, suggest 
##       Score: idea, want, someth, word, pictur, feel, pen 
## Topic 32 Top Words:
##       Highest Prob: design, product, use, manufactur, prototyp, make, machin 
##       FREX: machin, manufactur, prototyp, robot, assembl, mold, mount 
##       Lift: factori, mold, machin, durabl, laser, mount, robot 
##       Score: design, prototyp, product, manufactur, machin, robot, mold 
## Topic 33 Top Words:
##       Highest Prob: goal, kickstart, will, pledg, reach, fund, project 
##       FREX: reach, amount, facebook, goal, kickstart, link, exceed 
##       Lift: --noth, deadlin, exceed, incent, payment, twitter, surpass 
##       Score: goal, pledg, kickstart, reach, pleas, amount, fund 
## Topic 34 Top Words:
##       Highest Prob: will, adventur, island, fli, race, world, bird 
##       FREX: island, fli, bird, adventur, race, vehicl, sport 
##       Lift: island, flight, pet, bird, cat, fli, vehicl 
##       Score: adventur, island, race, fli, bird, pet, will 
## Topic 35 Top Words:
##       Highest Prob: show, seri, video, episod, season, web, produc 
##       FREX: episod, season, seri, web, show, youtub, pilot 
##       Lift: episod, season, podcast, beer, cartoon, pilot, youtub 
##       Score: episod, seri, show, season, video, web, pilot 
## Topic 36 Top Words:
##       Highest Prob: card, game, play, map, player, deck, rule 
##       FREX: card, deck, map, rule, dice, pdf, player 
##       Lift: deck, dice, card, map, rule, pdf, knight 
##       Score: card, game, deck, dice, player, map, play 
## Topic 37 Top Words:
##       Highest Prob: game, player, play, will, develop, charact, level 
##       FREX: game, puzzl, gameplay, player, mode, gamer, beta 
##       Lift: -game, puzzl, multiplay, gameplay, playabl, consol, game 
##       Score: game, player, gameplay, play, develop, charact, gamer 
## Topic 38 Top Words:
##       Highest Prob: use, can, develop, app, control, system, devic 
##       FREX: app, devic, softwar, user, data, phone, iphon 
##       Lift: app, data, monitor, applic, hardwar, iphon, devic 
##       Score: app, devic, user, softwar, control, system, data 
## Topic 39 Top Words:
##       Highest Prob: want, can, make, peopl, know, ’ve, ’re 
##       FREX: ’ve, ’re, want, ’ll, know, peopl, don’t 
##       Lift: ’ve, ’re, won’t, ’ll, can’t, don’t, isn’t 
##       Score: ’ve, ’re, ’ll, want, don’t, peopl, know 
## Topic 40 Top Words:
##       Highest Prob: will, creat, natur, piec, materi, project, space 
##       FREX: sculptur, natur, instal, burn, tree, wood, structur 
##       Lift: sculptur, burn, tree, structur, architectur, symbol, seed 
##       Score: sculptur, instal, piec, natur, space, materi, tree
plot(poliblogPrevFit, type = "summary", xlim = c(0, .06), text.cex = 0.5)

for (i in 1:40) {
  cloud(poliblogPrevFit, topic = i, scale = c(3, 1))
}
## Warning in wordcloud::wordcloud(words = vocab, freq = vec, max.words =
## max.words, : alway could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(words = vocab, freq = vec, max.words =
## max.words, : black could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(words = vocab, freq = vec, max.words =
## max.words, : war could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(words = vocab, freq = vec, max.words =
## max.words, : public could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(words = vocab, freq = vec, max.words =
## max.words, : nation could not be fit on page. It will not be plotted.

# Topic correlations
mod.out.corr <- topicCorr(poliblogPrevFit)
plot(mod.out.corr)

In natural language processing, data often arrive with little metadata to categorize the text. Although we have project category in our dataset, we have no mechanism, aside from text mining, to determine project themes, which may be highly related to success or failure. Therefore, the results of the LDA model could be useful for classification of successful and unsuccessful projects.

4.4 Acknowledgements

The following resources were invaluable to the completion of this section: * Text Mining with R: A Tidy Approach (Silge & Robinson, 2018; https://www.tidytextmining.com) * Class notes from Prof. Ujjal Mukherjee (University of Illinois at Urbana-Champgin, Gies College of Business) * stm: R Package for Structural Topic Models (Roberts, Stewart, & Tingley; https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf) * Binary text classification with Tidytext and caret (Hvitfeldt, 2018; https://www.hvitfeldt.me/2018/03/binary-text-classification-with-tidytext-and-caret/) * naivebayes package documentation (ftp://cran.r-project.org/pub/R/web/packages/naivebayes/naivebayes.pdf)